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Introductions 




+ HOME 




+ OVERVIEW 


About Biostatistics 


- B.IOSTATISTICS 


+ CARDIOVASCULAR HACD Biostatistics is one of several research elements comprising the Human Adaptation and 

Countermeasures Division (HACD) at the Johnson Space Center. This laboratory provides statistical 
+ COM ANALYTICAL consulting to HACD and the Space Medicine Health Care Systems Office fSMHCSQ), provides opportunities 

for high school and college students to be directly involved in the analysis and interpretation of biomedical 
-*■ CORE LABORATORIES research at NASA, and conducts independent research to address the special challenges raised by the 

idiosyncrasies of data often gathered on small numbers of human subjects under non-standard environments 
+ ^ PHYSIOLOGY | and regimens. 

+ EXERCISE PHYSIOLOGY 

Statistical Consulting 


+ HTSF/C-9 COORDINATION 
+ IMMUNOLOGY 
+ NEUROSCIENCES 
+ NUTRITIONAL BIOCHEMISTRY 
+ PHARMACOTHERAPtllTICS 
+ RADIATION 
+ TISSUE ANALOGUES 


Biostatistics provides consulting expertise, mainly to the HACD research laboratories, in the application of 
statistical theory and practice to ongoing biomedical research. Laboratory personnel often aid in the 
preparation of parts of research proposals that describe the experimental design, statistical modeling and 
subsequent analysis of anticipated research data. Once data is gathered. BSL statisticians also can assist 
with analysis and interpretation of results to help the investigators extract the most information consistent 
with the goal of maintaining statistical integrity. A BSL statistician may in fact be a co-investigator in projects 
requiring sophisticated statistical modeling and/or analysis techniques and will be expected to contribute 
descriptions of these techniques in forthcoming research papers. In these instances, the participating BSL 
statistician would be included as a co-author of such papers. Being involved as a consultant to other 
Bioastronautics research laboratories provides an excellent opportunity for the BSL statistician to expand 
his/her knowledge base in such diverse medical fields as environmental physiology, osteopathy, neurology, 
pharmacology, microbiology, cardiology, nutrition and 
psychology. Although HACD research laboratories 
are the laboratory's main customer, consulting 
support i3 also provided to the SUHCSO in support of 
NASA flight operations. 

Outreach 

Although the primary customers for the BSL reside 
within the Office of Bioastronautics, statistical 
consulting support is occasionally given to other 
organizations within the Johnson Space Center, such 
as the Engineering Directorate and Human Resources 
and Education Office. The BSL also provides a venue 
under which high school or college students, as 
summer interns, can be directly involved in the analysis and interpretation of NASA biomedical research data 





The Universities Space Research Association's Division of Space Life Sciences (DSLS) 
supports NASA's needs for understanding and counteracting the physiological changes that 
accompany space flight Based at USRA Houston, the DSLS manages extramural research 
programs, administers educational programs, coordinates a visiting/stafT scientist prog ram, and 
enhances collaboration between NASA and academic institutions through an extensive series of 
conferences, workshops, and seminars. This USRA division was established in 1983 as the 
Division of Space Biomedicine and facilitates participation of the university community in 
biomedical research programs at the NASA Johnson Space Center (JSC). 

The DSLS marked its 25th anniversary with a celebration on November 3, 2008 at USRA 
Houston. Followthis link to a story and photos. 

This site includes a video archive of talks presented at the UTMB/JSC Aerospace Medicine 
Residency Program Space Medicine Grand Rounds seminar series. These streaming video 
presentations require RealPlayer. 

Proceedings of re cent meetings and conferences coordinated by the DSLS are also included at 
this site. 

USRA Division of Space Life Sciences 
3600 Bav Area Blvd. Houston. Texas 77058 
Phone. 281-244-2000 
Fax: 281-244-2006 

For more information: 
info@dsls.usra.edu 

Last updated 
February 26, 2009 
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Built and Directed the Office of 
Institutional Research & Assessment 

- 2 / 1999 - 12/2001 

Three Primary Functions of the Office 

- Institutional Research 

• Ex. Enrollment Management 

• Ex. Salary Studies 

- Institutional Assessment 

• General Education 

• Program Majors 

• Administrative Units 

- Institutional Data Warehousing & 
Reporting 

• SUNY/NYS 

• Middle States & Other Accrediting 
Bodies 

• The Usual Hodgepodge of others. . . 






More Importantly, Who are YOU ? 
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• How many Directors of IR Offices? 

- New Directors? 

• How many Associate/Assistant Directors? 

- New to your position? 

• How many IR Analysts with 5+ years 
experience? 

• How many IR Analysts with less than 5 
years experience? 

• Other?? 


Purpose of This Module 
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• To provide Institutional Researchers with 
an understanding of the principles of 
advanced research design and the 
intermediate/advanced statistical 
procedures consistent with such designs 


You Will Learn How To Use 
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• Independent Measures Analysis of 
Variance (ANOVA) 

• Repeated Measures ANOVA/MANOVA 
Analysis of Covariance (ANCOVA) 

- ANOVA with Covariates 

• Simple and Multiple Regression 

- Block Regression 

- Forwards, Backwards, Stepwise Regression 


You Will Also be Exposed To 
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• Exploratory Factory Analysis 

- Principal-Axis Factoring 

• With Varimax Rotation 

• Time Series Regression 


I Assume That 
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• This isn’t your first course in statistics? 

• That you have the basics covered 
- Foundations I Level Stats 

• That you have SPSS loaded on your 
laptop machines 

• That you are interested and motivated to 
learn! 


Format of This Module 
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• Hands-On! 

- I’ll talk about the statistical tests, assumptions, 
theory first 

-Then we’ll walk through analyses together 

• Using SPSS 

- We’ll pay a lot of attention to 

• Analytic choices that you have 

• Interpreting the output 

• Presenting the results to your constituents 


Quick Review: 

Gaussian Distribution Function 






A.K.A. The “Normal 
Distribution” 

A.K.A. The “Bell-Shaped 
Curve” 


Has known probabilities 
associated with it, 


Thus all Parametric 
Statistics are based on 
the Gaussian Distribution 




Where x = mean, and a = standard deviation 


Quick Review 


T 



• About 68% of all 
scores fall within 1 SD 
unit from the mean. 



Quick Review: 
Gaussian 



• About 68% of all 
scores fall within 1 SD 
unit from the mean. 

• About 95% of all 
scores fall within 2 SD 
units from the mean. 


Function 


I 




e; : 


Quick Review: 
Gaussian Distribution 



• About 68% of all 
scores fall within 1 SD 
unit from the mean. 

• About 95% of all 
scores fall within 2 SD 
units from the mean. 

• About 99% of all 
scores fall within 3 SD 
units from the mean. 



-3a -2a -a 0 a 2a 3a 
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Central Limit Theorem 


• States that for any population with mean p 
and standard deviation a , the distribution 
of sample means with sample size n will 
approach a normal distribution with p and 
SD of as n approaches infinity. 

• REGARDLESS of the shape of the 
distribution in the population. 

• By the time sample sizes hit around 30, 
sampling distribution of means is close to 
normal. 


Demo of central limit theorem. 


T Central Limit Theorem 

[2 Exit 

Reset | 

=£> Mext 

J About j 



J_]JxJ 


T Central Limit Theorem 


■.!□! *1 


C^Exrt 

Q Reset 

"O Next 

J About 


memo5 


distribution: bimodal 



select type of distribution 
| bimodal 


3 


HTT 

Start 


sample size [ 6 


■3 


© Stop 



T Central Limit Theorem 


Exit I Q Rej 


P Show normal distribution 

| 37.14 ] mean 
I 4.80 standard deviatior 


memo5 


select type of distribution 
[exponential 3 

sample size m 


snr 


H 




Start 

© Stop j 




r Show normal distribution 
| 13.25 mean 

| 5.11 standard deviation 


distribution: normal 



select type of distribution 
| normal 3 

sample size | 6 | 


□“□r 


3 



Start 


r Show normal distribution 

1 34,82 mean 
| 4.24 " standard deviation 



Thus... 
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• Since we know so much about the Normal 
Distribution 

• And we know that sample summaries (means or 
otherwise) tend to follow that distribution 

- Even data collected from non-normal samples 

- Especially so with large sample size (big-n) 

• We can usually apply our knowledge of the 
normal distribution to statistical comparisons, 
estimates, and probability 

- As long as we do some preliminary screening... 


Ex. You may recall... 
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• That we can compare a person’s score to 
their population with the “Z-Score” 

- Z is a “standardization” 

• Mean = 0 

• SD = 1 

• Probability tables tell us percentiles, probabilities of 
being “that far” away from the mean. 


Z-score quick review 


• A student takes a standardized test and scores 
XXX. 

• Can compare their score to the population of all 
test-takers during that time, given population mean 
and standard deviation as: 

Z - — — — , where 

<7 

ju is population mean and cr is population std dev 



Z-score quick review 


• With their Z-score, we can glean 

- Their percentile rank [p(lower)] 

- Probability of scoring higher than them 

- Other relevant probabilities. 



T-statistic for Comparing Sample to Population 
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• Where we don’t know SD of the population, 
but we have sample data 

-Thus sample mean and sd 

• And we know from CLT that we can estimate 
population SD by SE 


SE = s 


s 


T-statistic for Comparing Sample to Population 


1 






Given our sample data, we 
can calculate 

- Sample mean 

- Sample SD (s) 

Given SE formula 



SE = s- 

X 




• We can calculate 
Confidence Intervals on 
this Estimate also 


• And with t-tables...p-values 


Moving to the t-test for comparing two samples 
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Used for comparing 
two samples collected 
randomly from two 
populations 



Sample 1 


Sample 2 

X=0 


X=4 

X=2 


X=6 

X=4 


X=8 


• Fairly simple 
modifications of the t- 
statistic comparing 
sample mean to 
population mean 


x = 2 
s = 2 


x = 6 
s = 2 




S Xi-X 2 


where s- - = 

X i —X 2 




s 2 s 2 

— H — — 


n 


i 


and = SS 1± SS^ = df^+dfA 
P df$-df 2 df x +df 2 




Dissect the formula: 



V— _ 

X\-Xi 



Dissect the formula: Numerator 


/ \ 

The difference between two sample means 

s *-x. 


Dissect the formula: Denominator 


1 



{Xi - Xi 


\ 

The difference between two sample means 




S X,-Xr 


r \ 

Divided by some measure of standard 

error of the differences 

J 




Dissect the formula: Question? 


_(Xi -Xi 

1 / 



The difference between two sample means 




Divided by some measure of standard 
error of the differences 


o 


o 



T-tests on the Computer: 


• Software gives us t-score and a p-value 

• Allowing us to test hypotheses that the two 
samples come from the same population 
(or not) 

• And describe the magnitude of the 
differences (confidence intervals) 

• Ex. t = 4.87, p<.001 

- H nu) |: Two samples are from same population 

- H a , t : Two samples are from different 
populations 

• Reject the Null (alpha < .05) & Report the 
maanitude of the differences 


Virtues of the t-test 
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• EVERYONE seems to understand it! 

• With CLT, it’s easy to apply to lots of 
different data scenarios 

• There are other versions that make it very 
flexible 

- Formula for “Repeated Measures” designs 

- Formula for problems associated with non- 
normality and/or variance heterogeneity 


Limitations of t-tests 
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• Alpha risk is .05 for each t-test 

- Probability of falsely rejecting the null, and 
concluding that there is a difference, when it’s 
really due to chance. 

- So comparing 3, 4, 5 or more groups is quite 
problematic! 

• With large samples, as with ANY statistical 
test, “significance” does not necessarily 
indicate a meaningful difference. 


Comparing Three Groups 


Comparing Three Groups 


I 


1 



Comparing Three Groups 


I 




T-test number 2 
Alpha risk = .05 


Group 3 


Comparing Three Groups 


Group I Group 2 Group 3 



Comparing Three Groups 



Analysis of Variance (ANOVA) 


• Can compare unlimited number of groups or 
occurrences, and still keep alpha risk = .05 

• Able to take multiple grouping (or time) factors 
into account and determine their independent 
and combined effects 

• Can examine “trends” in data, and can test 
specific (often complex) hypotheses 

• The analytic focus is on variance, but the 
interpretation falls back to means — thus results 
become intuitive 


Assumptions Required of ANOVA 
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• Data collected randomly from the 
population, with roughly equal n per cell 

-And sufficiently large n (n>30, common r-o-t) 

• Data measured on interval or ratio scale, 
and is normally distributed 

• Homogeneity of variance across groups 

• Sphericity for RM designs — variance of the 
differences between means for any pair of 
groups is equal to any other pair 


Assumption of Randomly Collected Data 
with Sufficiently Large n 


• In IR, we don’t always “randomly select” 

- But can we assume that “today’s” data is a random 
representation of “recent years?” 

- Or can we START randomly selecting a subset of 
your populations for research? 

• How big is big enough? 

- Rule of Thumb. . . at least 30 per group 

- More is better 

• Cautions about overpowered studies... 

- But BALANCE is critical!! 

• Rule of thumb — smallest group should not be less than 1 /3rd 
the size of the largest group. 


Assumption of Interval or Ratio Scale & 
Normality 
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• The “bell-shaped” curve — assumption of 
all parametric statistics 

• Studies show that ANOVA is robust to 
violations of this, but only if sample size is 
substantially large, and Homogeneity is 
met 



-3a -2a -a 


0 

x 


a 


2a 3a 


Assumption of Homogeneity of Variance 
Across Groups 


• Variance on the dependant variable should be similar 
across groups 

- Why? 

• Because we’re examining VARIANCE in ANOVA, and so 
we need for variance in each group to be roughly similar 
before we can conclude that any differences that we find 
are attributable to group differences (not mere variability 
differences). 

• Even in Means Comparisons (ex.t-tests), since Means 
are highly affected by variability, we need for variability 
to be similar in our groups so that differences that we 
find can be attributed to true group differences, and not 
merely by variability differences between our groups. 


More on Homogeneity of Variance 
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• If distributions are 
normal in one, then 
should be for all 



More on Homogeneity of Variance 
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• If distributions in 1 
group is 

leptokurtotic (tall 
and skinny), then it 
should be for all 
other groups 






More on Homogeneity of Variance 
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• If distributions in 1 
group is 

platykurtotic (short 
& fat) then it should 
be for all other 
groups 






More on Homogeneity of Variance 


1 






Any Miss-Match is a 
Problem 

- Because we might interpret 
a statistical differences to 
real group differences, 
when it’s actually due to 
heterogeneity of variance 

...Thankfully SPSS will 
test this assumption for us 
(stay tuned) 




What about skewed data? 
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• Positive or negative skews in the data can 
wreak havoc with statistical analysis 

-Thus always recommend thorough data 
screening 

- Identify outliers — data entry errors? 

- Consider data transformations if necessary 

• More on this later 



Two General Types of ANOVA 


• Independent Measures ANOVA (IM-ANOVA) 

- Data are collected from separate groups of subjects, and 
comparisons among groups are desired 

• Student GPA by MAJOR 

• Faculty Salaries by DEPARTMENT 

• Repeated Measures ANOVA (RM-ANOVA) 

- Data are collected from the same group of subjects on multiple 
occasions/times, and comparisons of occasions are desired. 

• Longitudinal Studies 

• Student Opinions as Fresh, Soph, Jr, Sr 

• Alumni Donations after 1, 3, 5, 7 years post-graduation 


IM & RM Designs... 
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f \ 


Repeated 
Measures Designs 









One-Way IM-ANOVA 


• For comparing two or more populations 

-Where sample data have been collected 



Population 1 

//=? w 

Population 2 

i 

Population .1 

a =? 

Sample 1 


Sample 2 


Sample 3 

0 


1 


4 

2 


4 


6 

4 


7 


8 



Z = 4 


vtr=6 


ANOVA: What’s in a Name? 





Total 

Variability 


Between Group 
Variability 


Within Group 
Variability 


Individual Differences (ID's) 
Error 

REAL GROUP DIFFERENCES! 


Individual Differences 
Error 







Analysis of Variance F-Ratio 
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• ANOVA is truly an analysis of a measure 
of variability, called “ variance ” 

- Within-Groups Variability 

- Between-Groups Variability 

• We Evaluate an “F-Ratio” Representing 
the Ratio of B/T over W/l: 

„ variability between groups __ I D's + error + group differences 

t — — — — 


variability within groups 


ID's + error 


Recall your Simple Algebra. . . 


If the same quantity exists in the 
Numerator and Denominator of a fraction, 
they “cancel each other out” 


The F-Ratio 


Assuming 
homogeneity 
of variance 



Individual Differences (ID's) 
Error 

REAL GROUP DIFFERENCES! 


Individual Differences 
Error 


F = 


_ variability between groups + efr'Qj* + group differences 


variability within groups 


'T'E^s + dre^r 


Recall your Simple Algebra. . . 


• If the same quantity exists in the 
Numerator and Denominator of a fraction, 
they “cancel each other out” 


• Leaving us with a number (F) that 
represents Group Differences! 

The F-Ratio 


F _ variability between groups _ > H^s + eh’Qf T^group difference^ 
variability within groups ''K^s + dls^r 


Analysis of Variance F-Ratio 


• If F=1 ... 

• As F increases 



F(4,12) 

5^ of the distribution 
is greater than 3.26 



F(1 0,100) 

5^ of the distribution 
is greater than 1 .93 


• How do you know if F is “big enough” to 
considered significant? 

- How do you know a t-test is significant??- 

The F-Ratio 


F = 


_ variability between groups _ 'I® + eh'Qj + group differences 


variability within groups 


N N^s + Sl , cor 


Confidence Intervals with the F-test 
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• Cl’s for comparing two groups are 
straightforward and intuitive 

• Cl’s for “Omnibus” differences are less so 

- Effect size calculations exist, but non-intuitive 
to statistically naieve 

• Stay tuned for discussions about post-hoc 
tests, and how they can sometimes help 

• Plots will also be very informative 


IM-ANOVA Summary Tables 
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• Purpose is to provide the necessary 
components of the F-test 

- Variability (SS) 

- Degrees of Freedom (df) 

- Mean Square (MS) 

- F-statistic (F) 

- Probability values associated with F 

• Total, Between Groups, Within Groups 


IM-ANOVA Summary Tables 


1 


Sum of Squared Deviations from 
the Mean 


Purpose is to provide the necessary 
components of the F-t 

- Variability (SS) 

- Degrees of Freedom (df) 

- Mean Square (MS) 

- F-statistic (F) 

- Probability values associated with F 


• Total, Between Groups, Within Groups 


IM-ANOVA Summary Tables 
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Purpose is to provide the necessary 
components of the F-t^± 

Like in a t-test, each F-test has df 


- Variability (SS) 

- Degrees of Freedom 

- Mean Square (MS) 

- F-statistic (F) 

- Probability values associated with F 


wjlues for significance testing 


• Total, Between Groups, Within Groups 


IM-ANOVA Summary Tables 
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Purpose is to provide the necessary 
components of the F-test 

- Variability (SS) 


- Degrees of Freedom ( 

- Mean Square (MS) 

- F-statistic (F) 

- Probability values associated with F 


MS is the Variance Statistic for 
ANOVA— calculated with SS ft df 


• Total, Between Groups, Within Groups 


IM-ANOVA Summary Tables 


• Purpose is to provide the necessary 
components of the F-test 

- Variability (SS) 

- Degrees of Freedom (df) 

- Mean Square (MS) f 


- Probability values associated with F 

• Total, Between Groups, Within Groups 


- F-statistic (F) 


The "F" statistic is another word for 
the F-ratio 


IM-ANOVA Summary Tables 
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Purpose is to provide the necessary 
components of the F-test 

- Variability (SS) 

- Degrees of Freedom (^Ifi 

- Mean Square (MS) 

- F-statistic (F) 


...and q values tell us the 
significanc e level of the ratio 


- Probability values associa 



• Total, Between Groups, Within Groups 


This is what it looks like... 


1 



df 

SS 

MS 

F 

2 

Between 

Groups 

## 

## 

### 

#.# 

.## 

Within Groups 
(error) 

## 

## 

### 


This is where it comes from (Independent 
Measures Designs) 
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So 

1. 

II 

JL 

^ 4 , 

1 

II 

M 

NJ 

(2>f 

N 

total 

N- 

1 

SS b ^ ee „=ir, k (x k -Gj 

k - 1 


between 

= k 

-1 

ss = V ss 

within / j L ° inside each group 


within 

= N- 

-k 


This is where it comes from (Independent 
Measures Designs) 




MS...., = 


total 


df, 


total 


MS 


SS 


between 


between 


dfi 


between 


ii jo within 

1V1 ^ within — 


df, 


within 



MS 


between 


within 


F-tables provide a £ value for a 
given F-statistic, using df between 
(numerator) and df within 
(denominator). 


Example 1 
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• Compare Faculty Ratings Across 3 
Departments 

- History 

- Psychology 

- Math 

• Simplest of ANOVA Models, with ONE 
Independent Factor (department) 



The Data: 


&| Hid Ml 

*1 

\M 

* 

mr 

:|G?| M 

MM 

BlffilBlI 



: 



±1 *IREXl.sav [DataSet2] - SPSS Data Editor 


File Edit View Data Transform Analyze Graphs Utilities Add-ons Window Help 




*! *IREXl.sav [DataSet2] - SPSS Data Editor 


File Edit View Data Transform Analyze Graphs Utilities Add-ons Window Help 




L3 1 H jQj Iut 


*■>1^1 ^|J?J Mj 


% Q> 


Name 


Type 


Width 


Decimals 


Label 


Values 


Missing 


Columns 


Align 


Measure 


dept 


Numeric 


11 


Department 


{1, Psych 101}... 


None 


12 


Right 


Nominal 


eval 


Numeric 


11 


Student eval of Intro course 


None 


None 


12 


Right 


Nominal 


Value Labels 


10 


ii 


12 


ZHY Data View ^Variable View f 


21 


22 


23 


24 


25 


26 


27 


28 


29 


Psych 101 


Psych 101 


Psych 101 


Psych 101 


Psych 101 


Psych 101 


Psych 101 


Psych 101 


Psych 101 


-Value Labels 
Value: 

Label: 


jJiiJ 



OK 


Cancel 


Help 




A 



One-Way Point-n-Click: 


*IREXl.sav [DataSet2] - SPSS Data Editor 


File Edit View Data Transform Analyze Graphs Utilities Add-ons Window Help 


- I SjUl Ml I 


21 


10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


20 


21 


22 


23 


24 


25 


26 


27 


28 


29 


30 


31 


32 


33 


34 

_35 

36 


! One-Way ANOVA 


dept 


Psych 


Psych 


Psych 


Psych 


Psych 


Psych 


Psych 


Psych 


Psych 


Psych 


Psych 


Dependent List: 



OK 


Paste 


Reset 


var 


var 


var 


var 


One-Way ANOVA: Post Hoc Multiple Comparisons 


2£l 


Psych- 


Psych 101 


Psych 101 


Psych 101 


Psych 101 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10 


Psych 10[ 


Equal Variances Assumed 


r LSD 

p Bonferroni 
p Sidak 
r Scheffe 
T R-E-G-WF 
r R-E-G-WQ 


S-N-K 


W^llor-Hi inrsn 


One-Way ANOVA: Options 


xj 


Statistics 


Equal Variances 
T amhane's T 


P Descriptive 
F Fixed and random effects 
P Homogeneity of variance test 
Brown-Forsythe 

F Welch 


Continue 


Cancel 


Help 


[Too" 


n 


> Control 


'sC 


var 


var 


var 


Syntax method: 

ONEWAY 
eval BY dept 

/STATISTICS DESCRIPTIVES HOMOGENEITY 
/PLOT MEANS 
/MISSING ANALYSIS 

/POSTHOC = TUKEY BONFERRONI SIDAK GH ALPHA(.05). 


Psych 101 


Psych 101 






One-Way ANOVA Output 




oneway_ANOYA.spo - SPSS Yiewer 


-ln| x| 


File Edit View Data Transform Insert Format Analyze Graphs Utilities Add-ons Window Help 


idalj&lQJ B| Ml ♦>! nlvJid <3>l £l *1 


±J*J +bd *lol ialAl 


a- 


Output 
Oneway 
(j§) Title 
{T] Notes 
[g| SPSS Text 
Descriptives 
SPSS Text 

Ljgjl Test of Homogeneity of Variance 
Q| SPSS Text 
g ANOVA 
01 SPSS Text 
[U Post Hoc Tests 
I® Title 

! (£ Multiple Comparisons 

01 SPSS Text 

5| Homogeneous Subsets 
! (jg) Title 

gj Student eval of Intro cour 

§ Means Plots 

Ljg) Title 

ill Student eval of Intro course 

01 SPSS Text 

02 Student eval of Intro course 

(3 Univariate Analysis of Variance 


E- 


a- 


Oneway-IR Example #1 : Oneway Independant-Measures ANOVA 


Notes 


Output Created 
Comments 


1 2-JUL-2005 1 5:24:1 1 

Input 

Data 

UAAIR Stat Institutes fit book 
chapter slAIR Stat Institute 
2 0 05^S PS SStuffLIREXI .sav 


Filter 

<none> 


Weight 

*none> 


Split File 
N of Rows in 
Working Data File 

<none» 

253 

Missing Value 

Definition of Missing 

User-defined missing values are 

Handling 


treated as missing 


Cases Used 

Statistics for each analysis are 
based on cases with no missing 
data for any variable in the analysis. 

Syntax 


ONEWAY 
eval BY dept 

STATISTICS DESCRIPTIVES 
HOMOGENEITY 
PLOT MEANS 
MISSING ANALYSIS 
POSTHOC = TUKEY ALPHAt.05). 

Resources 

Elapsed Time 

0:00 00.20 


[NjOT'E.Tha boldface, fted font above. SPSS |\]ormally "reduces" the |\|otes output, but it's handy to 
open it back up if you forget the exact syntax that was used to execute a statistical procedure, ^yen if you 
don't always run stats from syntax, it's nice to see that you OCJ 1— D duplicate an earlier analysis perfectly by 
copy ingthe syntax from the |\|otes output |\JOTE_ Al so that the file location and name are shown here too 


Descriptives 


Student eval of Intro course 







95% Confidence Interval for 
Mean 




N 

Mean 

Std Deviation 

Std. Error 

Lower Bound 

Upper Bound 

Minimum 

Maximum 

Psych 1 01 

86 

6.65 

1.713 

.185 

6.28 

7.02 

4 
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Skip to RM 

Two-Factor ANOVAS ancova 


• What if you want to compare 2+ groups on 

MORE THAN one factor? 

- Effect of students’ gender and race/ethnicity 
on performance? 

- Effect of students’ major and high school on 
cGPA? 

- Effect of faculty members’ level and age on 
job satisfaction ratings 


Two-Factor ANOVA Effects 


• Main Effects 

- One per factor... an F-statistic evaluating the impact of 
each factor in the model 

• Gender effect on performance (M/F diffs?) 

• Race/ethnicity effect on performance 

• Interaction Effects 

- One per interaction... an F-statistic evaluating how 
two (or more) factors interact with one another to 
affect the outcome 

• Gender “by” Race/Ethnicity interactive effects on 
performance 

• More complex... often more interesting! 


Example 2 


• Compare first term GPA by Major and 
Citizenship 

- U.S. versus non-U. S. 

-Three Majors — Math, Business, US History 

•2x3 ANOVA 



The Data 


S: IREX2.sav [DataSet3] - SPSS Data Editor 


File Edit View Data Transform Analyze Graphs Utilities Add-ons Window Help 


& 


Ml *1*1 yjd Ml dElitl Blfet-lBl [% <Q>| 


14 : 



stujd 

major 

cit 

termlqpa 

var 

var 

var 

var 

var 

var 1 

| 

var 

1 


Math | 

U.S. 

2.55 










±: *IREX2.sav [DataSet3] - SPSS Data Editor 


File Edit View Data Transform Analyze Graphs Utilities Add-ons Window Help 




. lelal Ml *1*1 klldMl >g|fel Mi |^<& 


Name 


stu id 


Type 


Numeric 


Width 


Decimals 


0 


Label 


Values 


None 


Missing 


None 


Colum 


8 


Align 


Right 


Measure 


Scale 


1 


8 


major 


Numeric 


{1, Math}.. 


None 


Right 


Ordinal 


Value Labels 


Value Labels 
Value: 

Label: 




OK 


Cancel 


S.}. 


None 


Right 


Ordinal 


None 


Right 


Scale 


Value Labels 


-LJ*J 




Change [ 


1 = ' 
2 = ' 
3 = ' 


Math" 
Business" 
US History" 


r Value Labels 
Value: 

Label: 


25 

25 

Math 

Other 

3.6 

26 

26 

Math 

Other 

3.4 

-IT 

— IT 

h 


o n 


Add 


Change 
Remove I 


1 = "U.S." 

2 = "Other" 


OK 


Cancel 


Help 




Back to 2-factor IM 


"f 


Repeated Measures ANOVA 


Only one sample 

Differences based on 
time, or condition 

Using the SAME 
subjects time after 
time 


Population 




Repeated Measures ANOVA 


1 




• Same people... no 
“individual 

differences” in the F- 
ratio 

• More powerful 
statistics 

The F-Ratio 




variability between conditions/times 


error + condition/time differences 


variability w r ithin the sample 


error 


Repeated Measures ANOVA 


• Same people... no 
“individual 

differences” in the F- 
ratio 

• More powerful 
statistics 

> The F-Ratio 



Population 




Sample 





variability between conditions/times 


el*Qr + condition/time differences 



variability within the sample 


RM-ANOVA Summary Tables 


1 


• Same Concept as IM Table, but now 

- Instead of “Between Groups” effects, we have 
“Between Treatments” effects 

- And also “Within Treatments” 

• Consist of subject differences (b/t subjects) 

• And error 

• One Group measured several times, thus we 
partition “within group” variability into that which 
is due to individual differences, and error. 


This is where it comes from 
(Repeated Measures Designs) 


SS lolal = Same as IM Anova 

SS between = Same &S IM AnOVa 

SS within = Same as IM Anova 


SS 


bit subjects 


=z 


(each person's total across treatments) ! £4 


k 


N 


cc = w - VV 

error within bit subjects 


df total = N ~ 1 


dfi 


between 


= k - 1 


df within =N~k 


df 


-n-l 


b 1 1 subjects 

df m =(N-k)-(n-\) 


This is where it comes from 



Designs) 


MS 


SS 


between 


between 


df, 


between 


MS ^or ~ 


SS 


error 


df t 


error 



MS 


between 


MS 


error 


F-tables provide a £ value for a 
given F-statistic, using df between 
(numerator) and df error 
(denominator). 


Example 3 


• Compare student satisfaction ratings over 
time (four time points) 

- Freshman 

- Sophomore 
-Junior 

- Senior 

• Same students... different times 




Using Covariates in ANOVA 


• Sometimes the apparent effects of one 
factor, can be “explained” by the effects of 
some other factor — called a covariate 

• There is a significant relationship between 
panty hose wearing behavior and a form of 
cancer. . .any ideas what kind of cancer? 


Using Covariates in ANOVA 


1 


• Let’s look at the Faculty Salaries from a 
fictitious Biology Department: 

- Comparing Salaries by Gender and Tenure 
Status 

• A simple 2-factor ANOVA 

-Then considering AGE as a possible 
covariate in our model 

• In-other-words, covary-out the effect of age and 
see if the salary differences remain... 


Example 4 


1 


• Describe effects of sex and tenure on 
faculty salaries for Biology faculty 

- Sex (male, female) 

-Tenure (tenured, untenured) 

- Use Age as a covariate 

• 2 (sex) x 2 (tenure) Model 

- With a covariate 


Related Advanced Topics 


T 


ANOVA can handle multiple factors 

- More than most humans can understand! 

Even just three factors can produce SEVEN effects! 

- 3-way interaction 

- A*B 2-way interaction 

- A*C 2-way interaction ^ 

- B*C 2-way interaction 

- A main effect 

- B main effect 

- C main effect 

• Care to interpret that? 

Four factors = 15 effects 
Five factors = 31 effects!! 

2 n -1 


Three-Factor Example in 
Monograph: 

Salary by 

Sex, Tenure, & Department 



Related Advanced Topics 


1 


• Mixed-Model ANOVAs 

• It is possible to consider both types of 
factors in a single model 


- Student satisfaction over time and by major 

- Student performance over time and by 
teaching modality 




Mixed-Model Quick Demo: 
Intellectual Growth 
by College overtime 


y 


Related Advanced Topics 


1 


• Simple Non-Parametrics 

- Chi-Square (best for 2x2’s; purely categorical 
outcomes) 

- Mann-Whitney U (good for 2 groups; rank or ordinal 
data) 

- Kruskal-Wallis H (if >2 groups, like Oneway ANOVA) 

• Loglinear ANOVA 

- Useful for more complicated multi-factor designs 

• Recommend additional training & reading 
- Allan Agresti’s Text is a good one 


New Advances 


i 

• Hierarchical Modeling (a.k.a. HLM, MLM) 

- Maximum Likelihood based, for using random 
(i.e. not fixed) factors 

- Assessing impact of “layers” of grouping 
factors 

• Observations within person 

• Person within group 

• Group within larger group 

- Much better at accommodating for missing 
data 

- Many variations for different distributions 

- Unfortunately, SPSS is not well-suited 


Foundations II Institute: The Advanced Practice 

of Institutional Research 


Day 2: The saga continues... 


/NR 

Association for 
Institutional Research 



[Found ations TT 

^“Institute 

L 2009 

Association for 
Institutional Research 



Shift Gears: Predicting “Y” from “X” (or 
several “Xs” 


• Z-tests, T-tests, AN OVA 

-All for comparing groups, or observations over 
time 



Now we’ll shift gears and talk about 
Regression Analysis 



Data Relationships with Two Variables 




• Data Relationships with Two Variables 

• Getting a Visual on relationships 

- Quantify relationships (Pearson’s r) 

• Strength 

• Significance 

- Background leading into Simple Regression 


Are Two Variables Related? 


• SES and GPA? 

• High School GPA and First Term College GPA? 

• SAT Scores and GPA? 

• ACT Scores and GPA? 

• SAT and ACT Scores? 

• We’re looking at PAIRS of variables 

• And we’re assessing relationships (non-causal) 


NOTE THAT 


T 


• We are not comparing means from 2 
different groups here! 

• We are trying to see if there is a 
relationship among two different variables 

- Usually continuous in scale 


We could look at a table of numbers? 


T 


Any relationship 
between X and Y 
here? 


1.805142 
2.1398 
31.695 
8.502829 
15.6 
1 .5995 
1 .985864 
25.596 
21.10395 
14.58116 
66.712 
1 .898284 


1 

2 

4 

_3_ 

J_ 

3 

A_ 

_5_ 

1 

5 
0 


66105 
1.7175 
10.072 
5.245146 
0.878264 
7.138 
10.662 
20.30073 
0.8408 
15.97328 
0.78812 


3 
2 
2 
6 
2 
5 

4 
0 

5 
0 
1 

2 

4 


21.31 

10.4: 

0.725 
18.693 
0.281206 
2.2576 
1.6274 
11.18872 



H 

1 

J 

K 

L 

V 

Y 

X 

Y 

X 

Y 


10.032 

2 

1.8963 

2 

2.4707 


1 .5777 

3 

3.579075 

0 

0.577854 


1.755798 

1 

0.869 

6 

29.882 


22.341 

3 

7.0433 

1 

1.678363 


3.3259 

2 

4.1348 

3 

3.9921 


7.4065 

1 

1.276016 

4 

8.705296 


2.515718 

3 

5.310606 

3 

5.6881 


2.520029 

6 

42.308 

2 

2.9196 


5.0723 

2 

4.1634 

2 

2.2384 

1 

1.106803 

6 

53.8 

1 

4.2645 

2 

4.319 

0 

0.309287 

6 

23.53278 

1 

1 .9365 

3 

5.996915 

0 

0.412273 

4 

10.885 

3 

5.088329 

0 

0.246107 

2 

5.8534 

6 

25.69648 

0 

0.710681 

2 

4.8349 

4 

13.59942 

4 

7.300787 

5 

18.4741 

4 

10.987 



3 

3.681995 

3 

3.957562 



5 

31 .549 

2 

2.8526 



2 

5.5092 

2 

1.9853 



•v 4 

17.05 

6 

27.397 



) 0 

0.434728 

4 

13.58486 



y 2 

2.1139 

1 

1.1956 



o 

3.3516 

3 

5.3035 



1 .9676 

4 

11.87442 



3 

7.764669 

1 

1 .446295 



2 

2.6187 

3 

4.9911 



3 

5.230134 

3 

4.310958 



2 

1 .4808 

3 

5.160213 



5 

30.4 

4 

16.924 



6 

58.879 

4 

4.4573 




5.746124 


20.283 


10.56362 


4.671579 


2.7033 


51 .927 


18.27308 


2.7903 


Scatterplots help us visualize... 


I 


1 


• Two continuous 
variables, X and Y 

• Scatterplot shows % 

association of X 1 

and Y 
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Variable X 


Recall basic Correlational Analysis 


1 


• Two continuous 
variables, X and Y 

• Scatterplot shows 
association of X 
and Y 

• Dots represent 
observations 
(usually people) 
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(X„ Y.) 


Variable X 


Quantifying the Relationship: Recall 
Linear Correlation 


Measures the direction and strength of the 
linear relationship b/t two variables. 

/ -v ^ 


r = 



x -x 




X 


y,-y 


A y J 


• Sign tells you what? 

• R-value tells you what? 

• P-value tells you what? 

• Causality? 


Scatter Plots... Footballs, Basketballs, 









Correlation Analysis Quantifies our 





Ordinary Least Squares Regression 


1 


• Using Simple Regression to Describe a 
Linear Relationship 

• Regression as a Descriptive Tool 

• Regression as a Forecasting/Prediction 
Tool 

• Making Inferences from Simple 
Regression 


Simple Linear Regression 


T 


Statistic used to describe linear 
relationships among variables: 


y — b 0 + b x x 



Just like HS Algebra: 
Y=m(X)+b 


y - the dependant (outcome) variable 
x = the independent (predictor, explanatory) varible 
b 0 = the y - intercept of graph of all (x, y) pairs 
b x = the slope of the line 


Consider a simple example of a 



Each green dot represents a person in the 
sample 



This 

person's 

Age 


X=Age (mo.) 






The OLS Regression Equation 

• A “best-fitting” line is calculated from the 
sample data 

• Can be used to predict Y, given/X 



Age (mo.) 



OLS Regression Lines Aren’t Perfect! 


• Some Predictions are too high: 


y 47 ci 24 x ^7 
Error = y-y 


o o 
xxo. 


E 

u 


u\ 

'33 



o o° ^^ t * >ec *' ctec * height (given age)> 


• Actual height (and age) 


Negatr 

error 


Age (mo.) 



OLS Regression Lines Aren’t Perfect! 


• Some Predictions are too low: 


y 47 ci 24 x ^7 
Error = y-y 


E 

U 


u\ 

'33 


Actual height (and age) • ° 

JCL-S? 



o o° ^^ t * >ec *' ctec * height (given age) 


} 


* o °'8 8 


Positivi 

error 


Age (mo.) 



OLS Regression Lines Aren’t Perfect! 


• Some Predictions are too high: 


y 47 ci 24 x ^7 
Error = y-y 


E 

U 


u\ 

'33 


Actual height (and age) ° ° 

O O 

o o° §3^ ec *' ctec * height (given age) 



Negative 

error 


Age (mo.) 



Deriving the “Best-Fitting” Line 


Getting the best line is KEY! 

One Approach would be a line that Minimizes 
the Sum of the Errors: 


n 


X Cr< - y > ) 


i — 1 



Age (mo.) 


n 


1 


Why not minimize ^ ? 


Because positive errors cancel out 
negative errors when minimizing sum of 
errors... " , x 

E(y,-A)=o 

i=l 


..We need a way to consider the sign of 
the error into the minimization function. 


...And Also... 


1 



• . . .Infinite number of lines meet this criteria, but 
only 1 is really “best” 


Another Approach? 


• Minimize the Absolute Value of Errors? 

Y\y,-y\ 

i - 1 

Least Absolute Value (LAV) 

■No unique l_AV line, plus very complicated 
calculations 


Third Approach? 


■Minimize the Squared errors? 



Least Squares Regression (LS) 


Is LS Minimization Possible? 


1 


Yes! 


For a Linear Equation y. = b 0 +b { x: ...simpler formula 


n 


Z (*«• “ - >0 

\ (slope) = ^— n 

Z ( y -^) 2 

i = 1 

b 0 (intercept) = y — bpc 


n i n n 

i=l « i=l /'=! 




IX- 


J w \ 2 


1=1 


/7 


IX 

V *=i 7 



...a NOTE about formulae 


• Hopefully you’ll be able to recognize what 
they’re doing. 

• But don’t worry about memorizing them! 

• SPSS (or whatever your choice of 
software) will calculate the components of 
the OLS Regression Line for you anyway! 


Minimizing the SQUARED Errors 


1 


• Takes care of the problem of positive 
errors “canceling out” negative ones 

• Computationally simple enough for hand 
or computer calculations 


Creates a unique “line of best fit” 


Let’s do one by hand... 


1 


■ 

1 


Yi 

1 

1 

3 

2 

2 

2 

3 

3 

8 

4 

4 

8 

5 

5 

11 

6 

6 

13 




Let’s do one by hand... 


■ 

1 

Xi 

Yi 

b, 

1 

1 

3 


2 

2 

2 


3 

3 

8 


4 

4 

8 


5 

5 

11 


6 

6 

13 


sum 

21 

45 



n 1 n n 

1=1 n 1=1 1=1 


n i ?i 

Z^, 2 — z*, 

w «In 


The bj formula 
requires E(x) 
and E(y) 


Let’s do one by hand... 


■ 

1 

*i 

Vi 


x2 A, 

1 

1 

3 

3 \ 

1 

2 

2 

2 

4 

4 

3 

3 

8 

24 

9 ^ 

4 

4 

8 

32 

16 

5 

5 

11 

55 

25 

6 

6 

13 

78 

36 

sum 

21 

45 

196 

91 


n 1 n n 

Y.X.y, 

M »MM 

n i f n 

j] x f — i>, 

M «In J 


The bj formula 
requires these 
two columns 
also... 


Let’s do one by hand... 


n 


i 


n 


n 


i 


1 

2 

3 

4 


x ; 


1 

2 

3 

4 


3 

2 

8 

8 


*iYi 


3 

4 

24 

32 


X‘ 


1 

4 

9 

16 


Z x ^- — Z *-■ Z >’ 


b, = 


7=1 


ft “ 


7=1 7 = 1 



77 1 / 77 

Z x ' — Z x , 

M n \M ) 

b 0 =y- b,x 

x = — = 3.5 
6 

y = — = 7.5 



Using the formula... 


For a Linear Equation y ; = b 0 + b { x : 


n 


\ = 


Z x <y> - 


i = 1 


i n n 

-Z^Z^ 

n m m 


196-^(21X45) 

0 


77 


1 / /I \ 2 

-! I x 

v »=i y 


n 


b n = 


0 


I*, 2 - 

z=l 

y-bpc = 7.5 -2.2(3. 5) = -0.2 


91- — (21) 
6 


38.5 

17.5 


= 2.2 


1 


Resulting in one, Unique LS 
Regression Line: 


y 


0.2 + 2.2x 


What does this line do for us? 

-Given x=15.... Predict y. 


32.8 


What can you say about the errors that 
you will make in your prediction? 



We've minimized our 
error in prediction 


Using LS Regression to Describe 
Relationships 


• Can be used to establish the weight 
assigned to various factors that may (or 
may not) predict some outcome 

- Ex. Assess the value of knowing an incoming 
students’ SAT Verbal on their first-term GPA, 
based on a sample of historical student 
application/enrollment data 


GPAprstterm ~ 1 -2 + .3 1 2(SA T verf)al ) 


Using LS Regression to Describe 
Relationships 


Can be used to establish the weigh \ 
assigned to various factors-tha Una 
may not) predict some outcome 

- Ex. Assess the value of knowing an incoming 
students’ SAT Verbal on their first-ten?) GPA, 
based on a sampl e of histor icaLsfadent 
applicatign/enroTTment data 


GPa7^>= 1 .2 <3 1 2(W 




verbal 


Let’s run a Simple Regression using SPSS 

I I w w 


Description of Dataset NH SAT A630 

The dataset contains annual information on the set of New Hampshire students who have taken the 
Scholastic Aptitude Test (SAT) each year from 1976 through 1998. The variables in the dataset 
are defined as follows: 


• YEAR 

• TOTAL 

• SATV 

• SATM 

• PCTDOCT 

• UNH 

• CPI 

• UNHPCT 

• LUNHPCT 

• Resident 

• Nonres 

• Private 

• Income 

• Lres 

• Lprivate 

• Lincome 

• Lnonres 


= Total number of SAT takers in NH 
= Average SAT-Verbal score 
= Average SAT-Math score 

= Percentage of SAT takers planning on pursuing a doctorate degree 
= Number of SAT takers sending test scores to the U New Hampshire 
= Consumer Price Index (1983 = 100) 

= % of total SAT takers sending test scores to U New Hampshire 
= Natural logarithm of UNHPCT (=LN(UNHPCT)) 

= Resident tuition rate at U New Hampshire (in $1000s) 

= Nonresident tuition rate at U New Hampshire (in $1000s) 

= Average private tuition rate in New England (in $1000s) 

= Median family income (in $1000s) 

= Natural logarithm of Resident (=LN(Resident)) 

= Natural logarithm of Private (=LN(Private)) 

= Natural logarithm of Income (=LN(lncome)) 

= Natural logarithm of Nonres (=LN(Nonres)) 


Let’s run a Simple Regression using SPSS 




• Let’s predict the number of SAT scores we 
might expect to be sent to UNH, so our 
Enrollment Management Office can do 
some strategic planning. 

-Y = UNH 

• Let’s use only 1 predictor — the total 
number of SAT test-takers 

- XI = TOTAL 

• Ok.. This is a little boring, but bare with me! 


SPSS Demo 


1 


• Simple Regression — used to describe the 
relationship between total number of SAT 
test takers in NH, and the number of SAT 
scores sent to NH Admissions. 


Moving Beyond Describing 
Relationships^. 


• Can we use Regression to “go beyond” 
what sample data suggest? 

• Can we make inferences about how two 
variables in the population might be 
related based on observations of sample 
data? 



would we want to do this? 


• Why bother when most IR offices have access to 
your local “population” of students (ex. All 
current incoming freshmen)? 

- i.e. you don’t randomly sample your student data 
warehouse... you tap all data you have?! 

• Because you want to make inferences about the 
whole incoming class based on “today’s” data 

• Because you want to make inferences about 
next semester, next year, etc... 

• Because you want to do some strategic planning 
that could benefit by this kind of analysis 


Statistical Inferences require Assumptions 
about the Population 


1 


• Given Population with X and Y 

ju , = conditional mean of y, given x, 
M ylx = Po + P\X where 


J3q = population y - intercept, and 

p x = slope of the population regression line. 


Population y Population y 


We assume Population X/Y Relationship is 
Linear 




Population x 


population y Population y 


We assume Population X/Y Relationship is 
Linear 



RE: The Population Conditional Mean... 



RE: The Population Conditional Mean... 

jLi r x = conditional mean of y, given x : 



20 30 40 

Population x 


T 


RE: The Population Conditional Mean. 

ju ylx = conditional mean of y, given x : 


My\x 40 


My\x — ^0 


My\ 


X 


= 20 





For given x, y varies, with 


a mean and variance 
...leading to Prediction 


Error... or 


Disturbances. . . adding a 
term to our regression 


equation: 


20 30 40 

Population x 


Resultant (Inferential) Regression 
Equation 


y, - Po + P\ x , 

• Where e- t represents the difference 
between TRUE y and the conditional 
mean of all y|x 

- Disturbances, or Error Variance 

- If our equation were perfect, e^O 

• What type of relationship would have to occur 
for this to happen? 


Leading up to some ASSUMPTIONS for 
Inferential Regression 


1 . Expected Value of disturbances = 0 
(E(e,)=0) 

a) i.e. The population regression is linear 

2. Homogeneity of Variance for 

3. e; are normally distributed 

4. ej are independent 



Inferences About p 0 and p. 


• b 0 and b 1 are point estimates of (3 0 and 

- I.e. they are calculated from samples, thus they are statistics 

- Thus they are random variables with probability distributions (sampling 
distributions) 

• Just as sample means are unbiased estimates as population 
means... 

- b 0 and b., are unbiased estimates of p 0 and p., 

- We assume that their sampling distributions are normally distributed 

- ...and as n increases, b 0 and b., become closer and closer to p 0 and p 1 

- Of all possible estimators of p^ b 1 has the desirable feature of having 
smaller sampling errors than any other unbiased estimator. 


In other words... 


1 


1 


• b 0 and b 1 are unbiased Estimators 

- Mean of sampling distribution = Pop mean 

• b 0 and b are Consistent Estimators 

- As n increases, estimator approaches parameter 

• b 0 and b., are a Minimum Variance Estimators 

- While there are other unbiased estimators “out there,” 
b 0 and b 1 have the smallest variance 


GIVEN all of these assumptions 


1 


• We can make hypotheses about p 0 and p 1t 
and use b 0 and b to test our hypotheses 

- like we can about means or other statistics. 

• But we’re missing one remaining 
component of the Regression Equation. 

- Variance around the regression line, or Error 
Variance 

-We know our predictions aren’t perfect, but 
we need to quantify the “error in prediction” 


Estimating Variance around the 
Regression Line 








The estimate of Variance @ Regression: 


g] =s] = — 


n — 2 


SSE 
n - 2 


= MSE 


SSE = sum of squared errors, adjusted by n-2 df to 
incorporate sample size. 


- Df=(sample size - num of coefficients) 

• We’re estimating two coefficients, b 0 and fc^ 

MSE=Mean Square Error = any SS/df 

2 

Square Root of s e is termed “ Stand Error of Regression 


Back to our SPSS Example. . . 
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• Note the SSR, SSE and MS terms in the 
ANOVA Summary Table 

• Use ANOVA Table to Evaluate 
Significance of the Regression Equation 

- More important later, when we discuss 
multiple regression 

• Interpret R 2 


Hypothesis testing in Regression 


1 

• We know that the LS estimates of bo, & bi are 
unbiased estimators of the Population 
coefficients Ik 

• We can calculate Confidence Interval Estimates 
(& significance values) of Ik 

• And we test hypotheses that the population beta 
weights (Ik) are significantly greater than some 
hypothesized value (usually 0) 

- Reject the notion that fk = hypothesized value if 
observed t-score probability < alpha 


The Theory Behind It? 


I 

• Create Null and Alternative Hypotheses: 

H o -Pk=Pk (usually 0) 

• Calculate t-score(s) for coefficients 

t - — — — - — (when Pi = 0) 

• Compare t-value to critical t (given alpha) 

Reject H 0 if t>t al2 or t<t a/2 

Accept H 0 if -t al2 <t<t al2 


A Closer look at t-formula^ :/?*=/?; (usually o> 



• Calculate t-score(s) for coefficients 



When null is true, t-should be (large or small)? Why?? 
When Alternative is true, t-should be (large or small)? 

V J 


A Closer look at t-formula 


1 


• Calculate t-score(s) for coefficients 

t = bk ~ — = — (when fa = 0) 


NOTE that we are NOT comparing two sample 
means, even though we're using a t-test! We 
are comparing our observed b k to a constant 
that we choose (/?*, which us usually 0). 



The “Typical” Situation 



• Testing Hypothesis that the coefficients 
are significantly higher or lower than 0 

t - b k~Pl _ K 

S h Sh Sh 

b k b k b k 

• If b x coefficient = 0, what would that mean 
in terms of the Regression Equation? 


a . . H o-P k = Pi (usually 0) 

If Null is Accepted h - b ^ b ; 


• Small t-statistic, p-value is not < alpha. 

- “X does not appear to be linearly related to Y” 

• l.e. x doesn’t help you predict y 



If Null is Rejected 


H o -P k = Pi (usually 0) 

jj ft ft* 

**a ' Pk^Pk 


• Large t-statistic, p-value is < alpha. 

- “There is evidence that y and x k are linearly 
related, and that x k helps explain some of the 
variation in y (not accounted for by the other 
explanatory variables).” 

• l.e. x is helpful in predicting y 

• Parentheses used in multiple regression... coming 
soon! 


How will we do this? 


• SPSS will calculate: 

- Some diagnostics that help us evaluate our 
assumptions about the data 

- Estimates of the coefficients 

- SD for the estimates of coefficients 

- T-scores comparing estimate to 0 

- p-values associated with the T-test 

• Humans will 

- Interpret the output in plain English! 

- Then explain it to constituents so that they can 
understand it too! 


Back to our SPSS Example 
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• Find the Table of Coefficients 

-T-values, p-values 

- Can you construct the linear equation? 

• Evaluate when our model is least & most 
accurate 

- Plot actual vs. predicted values 

- Calculate r, r 2 & Interpret 
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Simple vs. Multiple Regression 


• “Simple” refers to predicting a single 
outcome (y) from a single predictor (x) 

• “Multiple” refers to predicting a single 
outcome (y) from two or more predictors 
(xl , x2, x3) 

- Still assuming a linear relationship 

• But there are ways to “coax” linearity if it’s not 
already there... 


Multiple Regression Examples 


• Ex. Predict Faculty Salary from Age, 
Department, Years as Faculty Member 
and Gender 

• Ex. Predict Student Performance on GE 
Outcomes from Cumulative GPA, College 
Major, and Gender 


Multiple Regression 


] 
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• We’re still talking about Linear 

relationships a 

y = b 0 + b x x x + b 2 x 2 ... + b k x k 

• Still using method of Least Squares to 
develop the equation 

• Still estimating regression coefficients 
(betas) 


Multiple Regression 


k 



• But Graphing the equation results in a 
plane, or more complex geometric shape, 
not a line, even though the relationships 
are still linear... 

- 3-D graphing? 

- ...or 2-D graphing? 


Sales 


Multiple Regression Plane 
(example w/2 predictors) 




Predicting Sales from Price and Advertising 



Linear Regression w ith 
95.00% Mean Prediction Interval 


Sales = 8.31 + -0.09 * price + 0.38 * advertising 
R-Square = 0.45 


140 


160 
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Assumptions for the Population Multiple 
Regression 




1 . Expected value of disturbances is zero : E(e t ) = 0 

2. Variance of each e t is equal to a] 

(i.e. each disturbance along the regression 
line has equal variance regardless of value of x) 

3 . The e t are normally distributed. 

4. The e t are independent. (Be careful RE longitudinal 
data. .. usually not independant.) 




Assumptions for the Population Multiple 
Regression 


1 . Expected value of disturbances is zero : E(e f ) = 0 

2. Variance of each e i is equal to a] 

(i.e. each disturbance along the regression 
line has equal variance regardless of value of x) 

3. The e i are normally distributed. 

4. The e t are independent. (Be careful RE k 
data... usually not independant.) 

5 . Predictors themselves are independant 


One new assumption, 
because we have 
multiple predictors... 



Assessing the FIT of a Multiple 
Regression Line 


• In Simple Regression, we mainly focused 
on discussing the significance of the 
regression coefficients. 

• In Multiple Regression, we must also pay 
attention to the overall Regression 
Equation? 

- Is it any good at predicting? 

- How do we know? 


Assessing the FIT of a Multiple 
Regression Line 


• With Simple Regression, we didn’t pay 
much attention to this. 

- If the coefficient was significant, that implied 
that the equation itself was too. 

• With Multiple Regression, we must first 
evaluate the overall equation before diving 
deeper. 

-Then determine which, if any coefficients are 
significant. 


Recall Hypothesis Testing with Simple 
Regression... 

_ W ^ 


"l 



Hypothesis Testing with Multiple- 
Regression 


1 







The ANOVA Summary Table 


• Evaluation of the overall “Fit” of our 
Regression Equation 

• Do all coefficients= 0 (null hyp) or is at 
least one of them ^ 0 (alt hyp). 

• The results of the ANOVA are found in an 
ANOVA Summary Table... 

Reject H 0 if F > F(a; K, n - K - 1) 

Accept H 0 if F < F(a; K,n-K- 1) 


The ANOVA Summary Table 


Source 


ss 

MS 

1 

F 

2 

Regression 

K 


SSR 

SSR/K 

MSR/MSE 

p-value 

Residual 

Error 

n-k-1 


SSE 

SSE/(n-K-1) 



Total 

n-1 

SST 






Sums of Squares 

J 



ANOVA SS and MS terms... 


• The F-Ratio (MSR/MSE), and the 
associated p-value, tell us whether or 
not our regression equation is predicting 
a “significant” amount of the variance in 
yfrom knowledge ofx.,, x 2 ,...x k . 

- If p=value < 0.05 (traditionally), equation is 
said to be “significant” 


The R 2 Term 

(Generated from SS Terms) 


The variation in y: SST = SSE + SSR 

-“Total SS=Regression SS+ Error SS” 

R 2 = the ratio of explained-to-total 
variance (SS) is an evaluation of the 
overall regression. 

R 2 = 


SSR 


SST 

l.e. “percent of variance accounted for” 


ANOVA p-value VS. Multiple-R 2 


• ANOVA p-value tells us whether we can account 
for a significant proportion of variance in Y, by 
knowledge of all of the predictors (X^ X 2 ... X k ). 

- F=MSR/MSE... associated with a p-value 

• Multiple-R 2 tells is an estimate of how much 
variance we can account for in Y by knowledge 
of all of the predictors (X 1? X 2 ... X k ). 

- R2=SSR/SST 


The Multiple R-Square 
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The “Multiple-R 2 ” value is very similar to 
the correlation coefficient. 


R 2 = 


SSR 

~SST 


SSE ' 

1 




SST 


J 


• But in multiple-regression it has a flaw... 

- It doesn’t decrease as new predictors are 
added, even if they are “useless” additions. 



The Adjusted Multiple-R-Square 


We need to “adjust” the R 2 value to correct 
for the addition of more predictors 


Num 

predictors 


R 2 = 1 - 


SSE 

SST 



V = 1 - 


SSE l(n-K- 1) 
SST Kn - 1) 


Note how the SS in numerator and 
denominator are adjusted for their df? 


The Adjusted Multiple-R-Square 
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• This “adjustment” results in an adjusted R- 
square value that compensates for the 
number of predictors in the model. 

- No longer represents “the percent of variance 
accounted for” 

- But CAN BE used to compare different 
multiple-regression models 

• More on this later... 


Let’s Give it a Try, eh? 
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• SPSS Example of Multiple Regression 

- One Outcome Variable (number SAT scores sent to 
NH) 

- Several Predictors (Total SAT Takers, other 
predictors) 

• What’s the R 2 ? Adjusted R 2 ? 

• Is the equation itself any good? (i.e. can it 
account for sig. prop, of Y variance?) 

• Which, if any, of the predictors are useful? 

- Interpret 


Steps We’ll Take... 


1 . F-test for overall fit of regression 

2. If F-test is significant, examine the t-tests 
for each of the coefficients. 

3. Report the total percent variation in y 
explained by the x predictors 

4. Examine the Adjusted R2 


T 


SPSS Example 


• Multiple Regression predicting Number of 
SAT scores sent to UN H, by 

- Total number of test takers 

- Average SAT Verbal Score 

- Average SAT Math Score 
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Strategies when theory cannot guide 
you^ 
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• Thus far, the theory has been our guide on 
choosing predictors to consider 

- Theory is always the best strategy!! 

• Sometimes you may be on a “data mining” 
mission... 

• There are techniques that can help you 

- With Rob’s strong dose of caution! 


Statistical Strategies 
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• Selection algorithms: rules for deciding when 
to drop or add variables 

1 . Backwards Elimination 

2. Forward Selection 

3. Stepwise Regression 

4. Run All Possible Models 


Words of Caution 
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• None guarantee you get the right model because 
they do not check assumptions or search for omitted 
factors like curvature. 

• None have the ability to use a researcher's 
knowledge about the situation being analyzed. 

• Many among the scientific community do not respect 
statistical selection strategies like these because they 
are not grounded in theory, and they capitalize on 
sample variance relationships that may not exist in 
the population... 


Backwards Elimination 
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• Start with all variables in the equation. 

• Examine the variables in the model for 
significance and identify the least 
significant one. 

• Remove this variable if it does not meet 
some minimum significance level. 

• Run a new regression and repeat until all 
remaining variables are significant. 


Forward Selection 


I 

• At each stage, it looks at the x variables not in 
the current equation and tests to see if they will 
be significant if they are added. (I.e. a significant 
partial-F statistics would result.) 

• In the first stage, the x with the highest 
correlation with y is added. 

• At later stages it is much harder to see how the 
next x is selected. 


Stepwise Regression 


• A limitation with the backwards procedure is that 
a variable that gets eliminated is never 
considered again. 

• With forward selection, variables entering stay 
in, even if they lose significance. 

• Stepwise regression corrects these flaws. A 
variable entering can later leave. A variable 
eliminated can later go back in. 


Stepwise... 


• Begins like Forward (Chooses best predictor, adds 
it, tests for sig Partial-F, Keeps if pass criteria) 

• Next look at remaining predictors, choose “best” 
one (Highest Partial-F), and includes it. 

• Then behaves like Backwards... Potentially 
REMOVING one of the variables already included if 
it’s not necessary. 

• Then adds a new one... 

• Then tests to remove any of those included . . . 

• Until finished. 



Stepwise... 


• Ultimately, all variables in the equation are 
adding significantly, and none of the ones 
eliminated would (according to criteria we 
establish ahead of time) 

• It is possible to add a variable, remove it 
later, then add again at a later step! 


Example of using Stepwise with SPSS 


• Consider parameters for adding/removing 
variables 

- "Alpha to Remove” 

• maximum p-value a variable can have and stay in the 
equation 

- "Alpha to Enter" 

• minimum p-vale a variable needs to enter the equation 

• Often we use values like .15 or .20 because this 
encourages the procedures to look at models 
with more variables. 


SPSS Example 


■ 
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• Predict the number of SAT scores sent to 
NH 

- Using Stepwise Selection Technique 

• Including all (untransformed) predictors in the data 
set 


All Possible Models? 


• If reasonable, this is likely a better solution 
than Stepwise, but... 

- Some software (SPSS) cannot easily 
accommodate 

- Can be unreasonable if there are many 
potential predictors 

- Still not as good as theory 

• ex. what if a non-linear transformation is really the 
driver? 

• Model selection usually based on Adjusted 
R 2 for OLS regression 


Related Advanced Topics 
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• Regression with Bivariate or Multinomial 
Outcomes 

- Logit Procedure 

• Quite different in output and interpretation 

- Logistic Regression 

• Useful for >2 categories to the outcome variable 

• Regression for Ordinal outcomes 

- Poisson, Negative Binomial, Others 

• Hierarchical (nested) Regressive models 

- “Block” Regression in SPSS 

- Used to compare models with increasing 
complexity... 


New Advances 




• Hierarchical Linear Modeling (HLM) 

-A.k.a. MLM, Mixed Modeling 

• Allows for nesting to be considered 

• Allows both fixed and random effects 

• Allows for time-dependant covariates 

• Allows for group-level effects modeling 
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Time Series Regression 


• Some data are linear in their relationship, 
but have “cycles” that we’d like to capture 
also 

- Ex. New Admits over time 

• Where predictable cycles exist among Fall, Spring, 
Summer terms 


Number of Applications Reed 



Consider this time series 





Number of Applications Reed 



Consider this time series 



20 Year’s of 
Data! 

Number of 
applications 
rec’d since 
1985, where 
each 

observation is 
a semester. 


o- 


o 


-i 

12 15 


1 r 

18 21 


24 27 


“l 1 1 r 

30 33 36 39 

Observation 


-i 1 r 

42 45 48 51 


“I 1 1 1 1 T 

54 57 60 63 66 69 



Number of Applications Reed 



Consider this time series 




Beginning 
with Fall, 
1985 



Number of Applications Reed 



Consider this time series 




Then Spring, 
1 986 



Number of Applications Reed 



Consider this time series 




Followed by 

Summer, 

1986 .... 


And so on. 



Number of Applicationos Reed 


We could fit a linear regression to the 
data...R 2 = .71 


1 




Number of Applications Reed 


But if we could “capture” the cycle of 
FA/SP/SU wouldn’t that be better? 


1 




How? 




• Include “dummy” predictors indicating 
whether data is from a Fall, Spring, or 
Summer term 

- Use 2 of the 3, leaving one as “reference” 

• Run Multiple Regression: 




Comparison of the two models: 


Simple Regression Model 


Model Summar^ 


Time Series Regression 
with Seasonal Predictors 

Model Summar^ 


Model 

R 

R Square 

Adjusted 

RSffCiar^ 

Std. Error of 
the Estimate 

1 

.841 a 

.707 

C -702 

) 2161.135 


a. Predictors: (Constant), Observation 

b. Dependent Variable: Number of Applicationos 

ANOVA b 


a. Predictors: (Constant), Observation 

b. Dependent Variable: Number of Applicationos Reed 

Coefficients 1 


a. Dependent Variable: Number of Applicationos Reed 


Model 

R 

R Square 

Adjusted 
R Square 

Std. Error of 
the Estimate 

1 

,964 a 

.930 

f .926" 

\ 1075.387 


a. Predictors: (Constant), Spring, O fo e n i o tidh, Fall 

b. Dependent Variabl^Nypib^rof Applicationos Reed 

ANOVA b 


Model 

Sum of 
Squares 

df 

Mean Square 

F 

Sig. 

1 Regression 

6.8E+008 

1 

677290217.4 

145.014 

.000 a 

Residual 

2.8E+008 

60 

4670502.918 



Total 

9.6E+008 

61 





Model 

Sum of 
Squares 

df 

Mean Square 

F 

Sig. ; 

1 Regression 

8.9E+008 

3 

296815296.0 

256.659 

,000 a 

Residual 

67074505 

58 

1156456.975 



Total 

9.6E+008 

61 





Model 

Unstandardized 

Coefficients 

Standardized 

Coefficients 

t 

Sig- ! 

B 

Std. Error 

Beta 

1 (Constant) 

Observation 

761.925 

184.692 

555.637 

15.337 

.841 

1.371 

12.042 

.175 

.000 


a. Predictors: (Constant), Spring, Observation, Fall 

b. Dependent Variable: Number of Applicationos Reed 

Coefficients 1 


Model 

Unstandardized 

Coefficients 

Standardized 

Coefficients 

t 

Sig. 1 

B 

Std. Error 

Beta 

1 (Constant) 

-1352.942 

340.067 


-3.978 

.000 

Observation 

186.214 

7.634 

.848 

24.393 

.000 

Fall 

4490.993 

336.016 

.541 

13.365 

.000 

Spring 

1611.276 

336.016 

.194 

4.795 

.000 


a. Dependent Variable: Number of Applicationos Reed 


Is there anything else we could use in our 
data??? 




Is there anything else we could use in our 
data??? 


1 



Modeling Policy Changes... 

• Simply add a dummy predictor that 
captures the policy change! 


- ... Though it may be tempting, please make no assumptions 
about the fact that “policy change” and “dummy” are in the same 
sentence. © 


Variables Entered/Removeti 


Model 

— Variables Entered 

Variables 

Removed 

Method 

1 

^ew Admission Standards (FA9^ Spring, Fall, Observation 


Enter 


a. All requested variables entered. 

b. Dependent Variable: Number of Applicationos Reed 



Model Summary 


Model 

R 

R Square 

Adjusted 
R Square 

Std. Error of 
the Estimate 

1 

,965 a 

.931 

.926 

1080.052 


a. Predictors: (Constant), New Admission Standards 
(FA98), Spring, Fail, Observation 


Model 

Sum of 
Squares 

df 

Mean Square 

F 

Sig. 

1 Regression 

8.9E+008 

4 

222757289.6 

190.960 

,000 a 

Residual 

66491234 

57 

1166512.879 



Total 

9.6E+008 

61 





a. Predictors: (Constant), New Admission Standards (FA98), Spring, Fall, Observation 


b. Dependent Variable: Number of Applicationos Reed 


Coefficients 1 



Unstandardized 

Coefficients 

Standardized 

Coefficients 



Model 

B 

Std. Error 

Beta 

t 

Sig. 

1 (Constant) 

-1219.940 

389.910 


-3.129 

.003 

Observation 

177.909 

14.027 

.810 

12.684 

.000 

Fall 

4475.465 

338.188 

.539 

13.234 

.000 

Spring 

1604.053 

337.628 

.193 

4.751 

.000 

New Admission 
Standards (FA98) 

367.505 

519.725 

.045 

.707 

.482^ 


In this case, the policy 
change did not 
significantly impact 
admits after modeling 
time and cycles of 
semesters... 


a. Dependent Variable: Number of Applicationos Reed 


Results? 


T 


Model Summary 


ANOVA b 


Model 

R 

R Square 

Adjusted 
R Square 

Std. Error of 
the Estimate 

1 

,965 a 

.931 

.926^ 

1080.052 


a. Predictors: (Constant), New Admission Stan 
(FA98), Spring, Fall, Observation 


Coefficients 1 


Model 

Sum of 
Squares 

df 

Mean Square 

F 

Sig. 

1 Regression 

8.9E+008 

4 

222757289.6 

190.960 

,000 a 

Residual 

66491234 

57 

1166512.879 



Total 

9.6E+008 

61 





a. Predictors: (Constant), New Admission Standards (FA98), Spring, Fall, Observation 
Dependent Variable: Number of Applicationos Reed 




Unstandardized 

Coefficients 

Standardized 

Coefficients 



Model 

B 

Std. Error 

Beta 

t 

Sig. 

1 (Constant) 

-1219.940 

389.910 


-3.129 

.003 

Observation 

177.909 

14.027 

.810 

12.684 

.000 

Fall 

4475.465 

338.188 

.539 

13.234 

.000 

Spring 

1604.053 

337.628 

.193 

4.751 

.000 

New Admission 
Standards (FA98) 

367.505 

519.725 

.045 

.707 

.482 


And the adjusted R2 is 
similar to the simpler 
model that did not 
include the policy 
change predictors. 


a. Dependent Variable: Number of Applicationos Reed 
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Principal -Axis Exploratory Factor 
Analysis 


• Common in survey research 

• Useful for “discovery” of underlying 
constructs 

• Useful as a strategy for condensing data 

• Useful as a strategy to approximate 
continuous data from ordinal data 
elements 

- Combining several Likert-Scaled items into 
one construct score that behaves “normally” 


What EFA can do? 
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• Purpose is to discover simple patterns 
among variables 

• If patterns are found, we call them 
“factors,” or “constructs — hence the name 

• EX: Is intelligence uni-dimensional, or 
multi-dimensional? 

- Hint: Consider College Board Exams... 

• Verbal, Math, Logic 


EFA Graphic (after solution has been rendered) 


1 








Goals of EFA 
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• To better understand the underlying 
constructs. 


- Thus inferences are made about the 
constructs, not individual items 



Factor 1 


Factor 2 


Factor 3 




Factor 1 


How it works 


T 



Begins with a simple correlation matrix 

- In fact, you don’t need raw data in some cases 
(rotation) 


• EFA attempts to “categorize” variables according 
to how similar/dissimilar they are to other 
variables 


- By calculating factor loadings.... 

• Goal is to produce the minimum number of 
factors that adequately explains the data 


Some Details 
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• As always, there are options w/how to run 
EFA, and there are detailed references 
available 

• “Rotation” options to simplify our 
understanding of factor structure 

- Orthogonol (more complex structure, but 
more independence of factors) 

- Oblique (simpler structure, but factors may 
correlate more) 


Rotation 
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• Procedure (ex. Varimax) that searches for 
linear combinations (i.e. rotations) of 
original factors so that the variance of the 
loadings is maximized 



Varimax Rotation 


• Varimax Rotation is probably the most 
popular choice for EFA (Kaiser, 1958) 

- Each factor should have small number of 
items loading heavily on it 

• Each variable should load mostly onto only one 
factor 

• Thus simplifying our understanding of underlying 
constructs 


Wine Tasting Example 
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• Factor Rotations in Factor Analyses 

- Herve Abdi, University fo Texas at Dallas 

• Five Wines are Rated by Seven Questions 


Table 1: An (artificial) example for pca and rotation. Five wines are described 


by seven variables. 



Hedonic 

For 

meat 

For 

dessert 

Price 

Sugar 

Alcohol 

Acidity 

Wine 1 

14 

7 

8 

7 

7 

13 

7 

Wine 2 

10 

7 

6 

4 

3 

14 

7 

Wine 3 

8 

5 

5 

10 

5 

12 

5 

Wine 4 

2 

4 

7 

16 

FJ 

i 

11 

3 

Wine 5 

6 

2 

4 

13 

3 

10 

3 


Unrotated Two-Factor Solution: See any 
patterns? 
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* 2 - 


Hedonic 

Acidity • • . . . 

• Alcohol 

*i 

For meat 

• 

Price 


# 

For dessert 

• 


Sugar 

# 



Table 2: Wine example: Original loading? of the seven variables on the first two 
components. 




For 

For 






Hedonic 

meat 

dessert 

Price 

Sugar 

Alcohol 

Acidity 

Factor 1 

-0.3965 

-0.4454 

-0.2646 

0.4160 

-0.0485 

-0.4385 

-0.4547 

Factor 2 

0.1149 

-0.1090 

-0.5854 

-0.3111 

-0.7245 

0.0555 

0.0865 


Varimax Rotated Solution: See any 
patterns? 


1 



/ 


Hedonic 

Acidity • 

7 • AlcOhoJ 

For meat 



/ 

/ 


I 

I 

For dessert 

• ' 

I 

t 

?ugar 

/ • 


*1 

J. 0 = 15 ° 

■x ^ 

Price 


Table 3: Wine example: Loadings, after VARIMAX rotation, of the seven vari- 
ables on the first two components. 



Hedonic 

For 

meat 

For 

dessert 

Price 

Sugar 

Alcohol Acidity 

Factor 1 
Factor 2 

-0.4125 

0.0153 

-0.4057 

-0.2138 

-0.1147 

-0.6321 

0.4790 

-0.2010 

0.1286 

-0.7146 

-0.4389 -0.4620 

-0.0525 -0.0264 


• Seems likely that there 
are two dimensions 

- One factor of sweetness 

- The other linked to price 
and complex taste qualities 


Next Example using SPSS for EFA 
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• Data set provided by Mary Ann Coughlin 

- Editor & Co-Author of AIR 
Intermediate/Advanced Statistics in IR 
Monograph 

-Twenty-Seven item questionnaire of 
graduates from a small liberal arts college. 

- Asking about things they gained from their 
education 


Next Example using SPSS for EFA 


*e fd.sdv [DatdSet?] - SPSS Data Editor 


File Edit View Data Transform Analyze Graphs Utilities Add-ons Window Help 


& H & Gir E? #4 *i S it n 



Name 

Type 

Width 

Decimals 

Label 

Val 

(T 

t: 

wer 
ap ii 

ity-Seven Questions that 
ito perceived benefits of 
an education... 

Ripe for EFA! 

1 

gender 

Numeric 

8 

2 

Gender of Subject 

{1.00, F 

2 

writ sc 

Numeric 

1 

0 

Write effectively 

{□, Not 

3 

comm sc 

Numeric 

1 

0 

Communicate well orally 

{0, Not 

4 

acqu sc 

Numeric 

1 

0 

Acquire new skills and knowledge on my own 

{□, Not 

5 

thin_sc 

Numeric 

1 

0 

Think analytically and logically 

{□, Not 

6 

form_sc 

Numeric 

1 

0 

Formulate creative / original ideas and solutions 

{0, Not 

7 

eval_sc 

Numeric 

1 

0 

Evaluate and choose between alternative courses 

{0, Not 

8 

lead_sc 

Numeric 

1 

0 

Lead and supervise tasks and groups of people 

{0, Not 

9 

rel_sc 

Numeric 

1 

0 

Relate well to people of different races, nations 

{0, Not 

10 

func_sc 

Numeric 

1 

0 

Function effectively as a member of a team 

{0, Not : 

11 

comb_sc 

Numeric 

1 

0 

Use computers for basic tasks (word processing) 

{0, Not at atv- 






12 

comc_sc 

Numeric 

1 

[1 

Use computers for complex tasks (graphing) 

{0, Not at all}.^, 



TTIght 

Ordinal 

V 


13 

prob_sc 

Numeric 

1 

0 

Place current problems in historical prospective 

{□, 


Right 

Ordinal 

14 

mor_sc 

Numeric 

1 

0 

Identify moral and ethical issues 


9 

3 

Right 

Ordinal 

15 

und_sc 

Numeric 

1 

0 

Understand myself, my abilities, interests 

*TU7Not at all}... 

9 

8 

Right 

Ordinal 

16 

indsc 

Numeric 

1 

0 

Function independently without supervision 

{0, Not at all}... 

9 

8 

Right 

Ordinal 

17 

dept_sc 

Numeric 

1 

0 

Gain in-depth knowlegde of a field 

{0, Not at all}... 

9 

8 

Right 

Ordinal 

18 

comp_sc 

Numeric 

1 

0 

Plan and execute complex projects 

{0, Not at all}... 

9 

8 

Right 

Ordinal 

19 

forl_sc 

Numeric 


0 

Read or speack a foreign language 

{0, Not at all}... 

9 

a 

Right 

Ordinal 

20 

art_sc 

Numeric 

1 

0 

Appreciate art, literature, music, drama 

(0, Not at all}... 

9 

8 

Right 

Ordinal 

21 

brod_sc 

Numeric 

1 

0 

Acquire broad knowledge in the Arts and Sciences 

{0, Not at all} ... 

9 

8 

Right 

Ordinal 

22 

fern_sc 

Numeric 

1 

0 

Develop feminist awarenenss 

{0, Not at all}... 

9 

8 

Right 7 

Ordinal 

23 

soc_sc 

Numeric 

1 

0 

Develop awareness of social problems 

(0, Not at all}... 

9 

8 

Right 

Ordinal 

24 

self_sc 

Numeric 


0 

Develop self-esteem /self-confidence 

{0, Not at all}... 

9 

8 

Right 

Ordinal 

25 

frnd_sc 

Numeric 

1 

0 

Form close friendships 

{0, Not at all}... 

9 

r 

Right 

Ordinal 

26 

goal_sc 

Numeric 

1 

0 

Establish a course of action to accomplish goals 

{0, Not at all}... 

9 

8 

Right 

Ordinal 

27 

synt_sc 

Numeric 

1 

0 

Synthesize and integrate ideas and information 

{0, Not at all}.. 

9 

8 

Right 

Ordinal 

28 

sci_sc 

Numeric 

1 

0 

Understand the role of science and technoloav 

{0, Not at all}... 

9 

8 

Right 

Ordinal 

29 

ltdgpa 

Numeric 

6 

4 

Grade Point Average 

None 

None 

8 

Right 

Scale 

30 

verb_sat 

Numeric 

3 

0 

Verbal SAT Score 

None 


8 

Right 

Scale 

31 

math_sat 

Numeric 

3 

0 

Math SAT Score 

None 

0 

8 

Right 

Scale 

32 

comb sat 

Numeric 

8 

2 


None 

None 

8 

Right 

Scale 

33 

div 

Numeric 

1 

0 

Division of primary major field of study 

{1 , Humanities 

None 

8 

Right 

Ordinal 

34 










i ► \ Data View X Variable View / 

< 


1 > 



SPSS Processor is ready 


How would you summarize the data? 


• Table of Means? 

• Histograms? 


Table of Descriptive Statistics 


Descriptive Statistics 



N 

Range 

Minimum 

Maximum 

Mean 

Std. 

Statistic 

Statistic 

Statistic 

Statistic 

Statistic 

Std. Error 

Statistic 

Write effectively 

621 

3 

0 

3 

2.46 

.026 

.648 

Communicate well orally 

618 

3 

0 

3 

2.21 

.031 

.779 

Acquire new skills and knowledge on my own 

614 

3 

0 

3 

2.45 

.029 

.714 

Think analytically and logically 

616 

3 

0 

3 

2.40 

.029 

.716 

Formulate creative / original ideas and solutions 

611 

3 

0 

3 

2.15 

.032 

.803 

Evaluate and choose between alternative courses 

608 

3 

0 

3 

2.01 

.033 

.816 

Lead and supervise tasks and groups of people 

611 

3 

0 

3 

1.89 

.038 

.949 

Relate well to people of different races, nations 

615 

3 

0 

3 

2.23 

.035 

.873 

Function effectively as a member of a team 

610 

3 

0 

3 

1.91 

.035 

.860 

Use computers for basic tasks (word processing) 

611 

3 

0 

3 

2.02 

.042 

1.045 

Use computers for complex tasks (graphing) 

610 

3 

0 

3 

1.00 

.045 

1.115 

Place current problems in historical prospectivs 

609 

3 

0 

3 

2.23 

.034 

.835 

Identify moral and ethical issues 

612 

3 

0 

3 

2.09 

.034 

.842 

Understand myself, my abilities, interests 

614 

3 

0 

3 

2.41 

.032 

.800 

Function independently without supervision 

612 

3 

0 

3 

2.26 

.035 

.870 

Gain in-depth knowlegde of a field 

614 

3 

0 

3 

2.38 

.030 

.744 

Plan and execute complex projects 

604 

3 

0 

3 

2.11 

.033 

.809 

Read or speack a foreign language 

609 

3 

0 

3 

1.38 

.048 

1.196 

Appreciate art, literature, music, drama 

611 

3 

0 

3 

2.12 

.036 

.899 

Acquire broad knowledge in the Arts and Sciences 

614 

3 

0 

3 

2.19 

.032 

.796 

Develop feminist awarenenss 

613 

3 

0 

3 

2.56 

.029 

.710 

Develop awareness of social problems 

610 

3 

0 

3 

2.36 

.030 

.747 

Develop self-esteem /self-confidence 

614 

3 

0 

3 

2.35 

.035 

.859 

Form close friendships 

617 

3 

0 

3 

2.46 

.032 

.803 

Establish a course of action to accomplish goals 

614 

3 

0 

3 

2.13 

.031 

.780 

Synthesize and integrate ideas and information 

612 

3 

0 

3 

2.30 

.029 

.720 

Understand the role of science and technology 

614 

3 

0 

3 

1.55 

.039 

.956 

Valid N (listwise) 

537 









B u cket-o- H i stog ra m s 


Writ* effectively 


Writ* effectively 


Acquire new skills end Knowledge on my own 


IK analytically and logically 


Formulate creative I original Ideas and solutions 



Etc 














What’s the Big Picture? 


1 


• Simple to get the “take-home” message 
from our graduates? 

- Generally happy 

- Simple Descriptives can tell us that. 

• But what’s the “big picture” of the benefits 
of a college education from our college, in 
the eyes of our graduates? 

- This kind of question is ripe for EFA 


Run Exploratory Factor Analysis 
- Principal-Axis Method 
-With Varimax Rotation 

Interpret the Results! 


Six Underlying Constructs Resulting from our 
Rotated Factor Structure 


Rotated Factor Matrift 



Factor 

1 

2 

3 

4 

5 

6 

Think analytically and logically 

.643 

.150 

.091 

.006 

.118 

-.026 

Formulate creative / original ideas and solutions 

.625 

.146 

.145 

.199 

.031 

.124 

Synthesize and integrate ideas and information 

.587 

.202 

.217 

.113 

.208 

.072 

Acquire new skills and knowledge on my own 

.581 

.070 

.141 

.090 

.095 

.146 

Plan and execute complex projects 

.537 

.042 

.127 

.148 

.230 

.168 

Write effectively 

.512 

.263 

.094 

.027 

-.007 

.142 

Establish a course of action to accomplish goals 

.496 

.195 

.386 

.223 

.204 

.112 

Evaluate and choose between alternative courses 

.485 

.175 

.176 

.397 

.087 

.067 

Communicate well orally 

.460 

.119 

.265 

.207 

.037 

.168 

Gain in-depth knowlegde of a field 

.442 

.074 

.080 

.039 

.160 

.117 

Develop awareness of social problems 

.158 

.732 

.233 

.091 

.102 

.099 

Develop feminist awarenenss 

.082 

.560 

.148 

.025 

-.037 

.158 

Identify moral and ethical issues 

.288 

.542 

.168 

.185 

.098 

.126 

Place current problems in historical prospective 

.303 

.433 

.040 

.166 

-.063 

.198 

Form close friendships 

.122 

.163 

.569 

.143 

.077 

.084 

Understand myself, my abilities, interests 

.328 

.275 

.547 

.159 

-.056 

.156 

Develop self-esteem /self-confidence 

.370 

.313 

.543 

.165 

.096 

.162 

Function independently without supervision 

.372 

.100 

.455 

.170 

.165 

.189 

Relate well to people of different races, nations 

.125 

.335 

.337 

.307 

.165 

.070 

Lead and supervise tasks and groups of people 

.178 

.089 

.189 

.724 

.149 

.048 

Function effectively as a member of a team 

.160 

.190 

.208 

.567 

.240 

.038 

Use computers for complex tasks (graphing) 

.082 

-.034 

-.024 

.112 

.601 

-.030 

Understand the role of science and technology 

.380 

-.013 

.089 

.138 

.571 

.164 

Use computers for basic tasks (word processing) 

.122 

.118 

.247 

.108 

.402 

.053 

Appreciate art, literature, music, drama 

.178 

.248 

.141 

.088 

-.054 

.650 

Acquire broad knowledge in the Arts and Sciences 

.269 

.124 

.064 

.050 

.277 

.413 

Read or speack a foreign language 

.076 

.079 

.078 

.003 

.038 

.402 


Extraction Method: Principal Axis Factoring. 

Rotation Method: Varimax with Kaiser Normalization. 


a. Rotation converged in 7 iterations. 


Six Underlying Constructs Resulting from our 
Rotated Factor Structure 



Rotated Factor Matrix 



Factor 

1 

2 

3 

4 

5 

6 

Think analytically and logically 

.643 

.150 

.091 

.006 

.118 

-.026 

Formulate creative / original ideas and solutions 

.625 

.146 

.145 

.199 

.031 

.124 

Synthesize and integrate ideas and information 

.587 

.202 

.217 

.113 

.208 

.072 

Acquire new skills and knowledge on my own 

.581 

.070 

.141 

.090 

.095 

.146 

Plan and execute complex projects 

.537 

.042 

.127 

.148 

.230 

.168 

Write effectively 

.512 

.263 

.094 

.027 

-.007 

.142 

Establish a course of action to accomplish goals 

.496 

.195 

.386 

.223 

.204 

.112 

Evaluate and choose between alternative courses 

.485 

.175 

.176 

.397 

.087 

.067 

Communicate well orally 

.460 

.119 

.265 

.207 

.037 

.168 

Gain in-depth knowlegde of a field 

.442 

.074 

.080 

.039 

.160 

.117 

Develop awareness of social problems 

.158 

.732 

.233 

.091 

.102 

.099 

Develop feminist awarenenss 

.082 

.560 

.148 

.025 

-.037 

.158 

Identify moral and ethical issues 

.288 

.542 

.168 

.185 

.098 

.126 

Place current problems in historical prospective 

.303 

.433 

.040 

.166 

-.063 

.198 

Form close friendships 

.122 

.163 

.569 

.143 

.077 

.084 

Understand myself, my abilities, interests 

.328 

.275 

.547 

.159 

-.056 

.156 

Develop self-esteem /self-confidence 

.370 

.313 

.543 

.165 

.096 

.162 

Function independently without supervision 

.372 

.100 

.455 

.170 

.165 

.189 

Relate well to people of different races, nations 

.125 

.335 

.337 

.307 

.165 

.070 

Lead and supervise tasks and groups of people 

.178 

.089 

.189 

.724 

.149 

.048 

Function effectively as a member of a team 

.160 

.190 

.208 

.567 

.240 

.038 

Use computers for complex tasks (graphing) 

.082 

-.034 

-.024 

.112 

.601 

-.030 

Understand the role of science and technology 

.380 

-.013 

.089 

.138 

.571 

.164 

Use computers for basic tasks (word processing) 

.122 

.118 

.247 

.108 

ML- 

.053 

Appreciate art, literature, music, drama 

.178 

.248 

.141 

.088 

-.054 

.650 

Acquire broad knowledge in the Arts and Sciences 

.269 

.124 

.064 

.050 

.277 

.413 

Read or speack a foreign language 

.076 

.079 

.078 

.003 

.038 

.402 


Extraction Method: Principal Axis Factoring. 

Rotation Method: Varimax with Kaiser Normalization. 


a. Rotation converged in 7 iterations. 


Six Underlying Constructs Resulting from our 
Rotated Factor Structure 


T 


Rotated Factor M*±r±a. 


Think analytically and logically 
Formulate creative / original ideas and solutions 
Synthesize and integrate ideas and information 
Acquire new skills and knowledge on my own 
Plan and execute complex projects 
Write effectively 

Establish a course of action to accomplish goals 
Evaluate and choose between alternate courses 
Communicate well orally 
Gain in-depth knowlegde of a field 
Develop awareness of social problems 
Develop feminist awarenenss 
Identify moral and ethical issues 
Place current problems in historical prospective 
Form close friendships 
Understand myself, my abilities, interests 
Develop self-esteem /self-confidence 
Function independently without supervision 
Relate well to people of different races, nati 



Knowl edge Ga 


Read or speack a foreign language 


Extraction Method: Principal Axis Factoring. 

Rotation Method: Varimax with Kaiser Normalization. 

a. Rotation converged in 7 iterations. 


ns 



4 

5 

6 

.091 

.006 

.118 

-.026 

.145 

.199 

.031 

.124 

.217 

.113 

.208 

.072 

.141 

.090 

.095 

.146 

.127 

.148 

.230 

.168 


Social/Mora 


.158 

.732 

C' .233 r 

.082 

.560 

.148 

.288 

.542 

.168 

.303 

.433 

.040 

.122 

.163 

.569 

.328 

.275 

.547 ^ 

.370 

.313 

.543 

.372 

.100 

.455 


Reasoning 


.087 

.067 

.037 

.168 

.160 

.117 


Self- 

Awareness/lndependence 


Lead and supervise tasks and groups of pe< 


Tech 

nology 

Function effectivelv as a member of a team 


.160 



Use computers for complex tasks (graphing) 

.082 

-.034 


Understand the role of science and technology 

.380 

-.013 

.089 

Use computers for basic tasks (word processing) 

.122 

.118 

.247 

Appreciate art, literature, music, drama 


.178 

.248 

.141 

Acquire broad knowledge in the Arts and Sciences 

.269 

.124 

.064 


Arts & Humanities 



Advanced Topics Related to EFA 


1 


• Confirmatory Factory Analysis 

- Testing hypotheses about factor structure 

• Structural Equation Modeling 

-Testing structure on the “predictor” and 
“outcome” side of the equations 

- Path-Analysis-Like model fitting 

- Mediator/Moderator effects testing 

- Just to name a few examples! 


Foundations II Institute: The Advanced Practice 

of Institutional Research 


Questions? Comments? 
Course Evaluations Please!!! 


/NR 

Association for 
Institutional Research 


Robert Ploutz-Snyder, Ph.D. 

Biostatistician NASA JSC 
USRA / Division of Space Life Sciences, 
Research Associate Professor of Medicine 
SUNY Upstate Medical University 

r 



What topics needed more info? 
Less info? If more, what 
should I eliminate?? 



[ 


Association for 
Institutional Research 



